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15 (54) [Title of the invention] Method and computer 

system for detecting error strings in a text 
(57) [Abstract] 

[Object] To provide a method and computer system 
for detecting or correcting error strings in a text 
20 [Construction] The present invention relates to a 

method and computer system for detecting or correcting 
error strings Fi in a text stored on a computer system. 
In Step 10 of this method, initially, an error-free 
string Si is selected. In Step 11, an error string f±j 
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that may arise is generated by modifying the error-free 
string Si. This is useful in the calculating the value 
of aij in Step 12. The value of a±j expresses the 
probability that a possible error string f ±j constitutes 
5 an actual error string Fi. Consequently, if the value of 
aij exceeds a threshold value (3, in Step 14, this result 
is output. Otherwise, Step 15 is executed, after which 
a further possible error string f ±j is generated in Step 
11 . 

10 

10 selection of Si 

11 generation of fij 

12 calculation of a±j 

13 to 15: no 
15 13 to 14: yes 

[Claims ] 

[Claim 1] A method for detecting or correcting an 
error string Fi in a text using a computer system, 
20 comprising: 

a step of, in order to detect or correct an error 
string, employing the frequency H(Si) of the 
corresponding error-free string Si in said text, and 
thereby arranging that the error-free string Si appears 
25 in said text; and 

a step of storing said text in said computer 
system. 

[Claim 2] The method as claimed in claim 1 which 
includes : 

30 (a) a step of modifying an error-free string Si in 

accordance with a rule Ri such that a possible error 

string fij is generated; 

(b) a step of determining the frequency H(fij) of 

the string fij in said text; 
35 (c) a step of comparing the frequencies H(fij) and 

H(Si); and 

(d) a step of deciding whether a possible error 
string fij is an actual error string Fi, based on the 



comparison of Step (c) . 

[Claim 3] The method as claimed in claim 2 wherein 
a step of simulating, by suitable selection of a rule 
Ri, a source of possible errors that are psychological 
5 or technically related to the computer system that is 
employed, is included in Step (a) of claim 2. 

[Claim 4] The method as claimed in claim 2 or 3 
wherein, when comparing the frequency H(fij) of a 
possible error string f±j with the frequency H(Si) of an 
10 error-free string Si, a step of conducting an evaluation 
in accordance with a rule for calculating the values 
H ( f ij ) and H (Si) : 

[math 1] 

<Dij (H(fi-j) ,H(S±) )=aij (1) 
15 is included in Step (c) of claim 2, and a step of 

then comparing the value a±j with the threshold value (3 
is included in Step (d) of claim 2. 

[Claim 5] The method as claimed in claim 4 which 
includes a step of defining a computer rule O as 
2 0 [math 2] 

<Dij (H(fi-j) ,H(S±) ) = (H(S±) ) /H(fi-j) .¥)k (2) 

when ¥ is a factor and k is an exponent which is 1 
or -1. 

25 [Claim 6] The method as claimed in claim 5 wherein, 

when L(Si) is the number of characters in the error-free 
string Si and p is an exponent equal to 2 or 3, the 
factor ¥ is defined as 
[math 3] ¥=[L(S ± ) ] p 

30 [Claim 7] The method as claimed in claim 4, 5 or 6 

wherein, in said computer system, a dictionary-based 
method is implemented that is employed for the 
determination of valid strings Gi, and including a step 
of, for a possible error string f±j having a frequency 

35 H(fij) that is greater than 0, using said dictionary- 
based method to determine whether the string f±j is a 
valid string, and, if a possible error string f±j is a 
valid string Gi, modifying the value a±j of the possible 



error string f ij . 

[Claim 8] The method according to any of claims 4 
to 7 which, in said computer system, includes a step of 
implementing a method for automated learning by 
5 assigning to a rule Rj a variable factor 5j (B) and 
using that factor to modify the value a±j of a possible 
error string f ±j generated by applying the rule Rj in 
step (a) of claim 2 . 

[Claim 9] A method for detecting or correcting 
10 error strings Fi in a text, which includes: 

(a) a step of detecting the frequencies H(Zi) of 
different strings Z± in the text and defining those 
strings Z± having a frequency H(Zi) exceeding a 
threshold value y as error-free strings Si; and 
15 (b) a step of detecting or correcting an error 

string Fi associated with an error-free string Si in 
accordance with the method of any of claims 1 to 8 . 

[Claim 10] The method as claimed in claim 9 which 
includes, in step (b) of claim 2, a step of sorting and 
20 storing strings Zi by their corresponding frequencies 
H(Zi) in said computer system and conducting a binary 
search of said sorted strings Zi to determine the 
frequency H ( Zi) . 

[Claim 11] The method as claimed in claim 10 which 
25 includes a step of executing a hashing method of 
sorting the strings Zi in accordance with their 
corresponding frequencies H(Zi), or using a tree 
construction constituting a binary tree or linked 
tries . 

30 [Claim 12] The method according to any one of 

claims 9 to 11 which includes a step of calculating the 
corresponding values a±j for various possible error 
strings fij of various error-free strings Si, and 
automatically replacing those possible error strings fij 

35 that are error strings Fi according to the decision in 
step (d) of claim 2 by said stored corresponding error- 
free strings Si in the text. 

[Claim 13] The method as claimed in claim 12 which 



includes (a) a step of sorting the various possible 
error strings fij according to their corresponding 
values ai-j, and 

(b) a step of selecting said criterion such that 
5 only those possible error strings fij that satisfy the 
criterion of the values a±j are employed in step (d) of 
claim 2, and thereby using said criterion as a 
threshold value (3. 

[Claim 14] A computer system whereby an error 
10 string Fi in a text is detected or corrected, thereby 
causing the corresponding error-free string Si to appear 
in said text, comprising: 

first storage means for storing said text; 
second storage means for storing the frequency H(Si) 
15 of said error-free string Si; and 

processor means using the frequency H(Si) of the 
error-free string Si in detecting or correcting the 
error string Fi. 

[Claim 15] The computer system as claimed in claim 
20 14 which comprises: 

third storage means that stores the frequency H(fij) 
of a possible error string fij and fourth storage means 
that stores a rule Rj, and comprising: 

means for modifying the error-free string Si 
25 according to the rule Rj whereby said processor can 
generate a possible error string f^; 

means for determining the frequency H(fij) of a 
possible error string f^; 

means for comparing the frequencies H(Si) and 
3 0 H (f ij) ; and 

means for associating a possible error string fij 
with the error string Fi based on the output signal from 
said comparison means. 

[Claim 16] The computer system as claimed in claim 
35 15 wherein: 

said comparison means provided in said processor 
means comprises: 

calculation means that calculates the value a±j in 



accordance with a calculation rule : 
[math 4] 

(Dij (H(fi-j) ,H(Si) )=a ±j (1) 
said output signal transmits the value a^; and 
5 said associating means comprises means that stores 

a threshold value (3 for comparison with the value a±j . 

[Claim 17] The computer system according to any one 
of claims 14 to 16 which comprises: 

means for determining the frequency H(Zi) of 
10 different strings Z± in said text; 

fifth storage means for storing the frequency H(Zi); 
means for storing a threshold value y, and 
comparison means for comparing the threshold value 
Y with a frequency H(Zi), whereby those strings Zi 
15 having a frequency H(Zi) exceeding the threshold value y 
are defined to be error-free strings Si. 

[Claim 18] A character recognition system 
comprising an automated optical character recognition 
system and a computer system according to any one of 
20 claims 14 to 17 wherein 

said automated optical character recognition system 
generates raw text for detection or correction of an 
error string Fi and inputs said raw text into said 
computer system. 
25 [Claim 19] An automated dictation recording system 

comprising a speech recognition system and a computer 
system according to any one of claims 14 to 17 wherein 

said speech recognition system generates raw text 
for detection or correction of an error string Fi and 
30 inputs said raw text into said computer system. 

[Detailed description of the invention] 
[0001] 

[Field of industrial application] The invention 
35 relates to a method and computer system for detecting 
or correcting an error string in a text. 
[0002] 

[Prior art] In known word processing systems, 



entered text is stored separate from a dictionary. The 
dictionary associated with a word processing system is 
a file that contains a reasonably complete list of 
known words and if possible their inflected forms, 
5 i.e., their conjugations and forms deviating from the 
standard. When searching for errors in text, each 
individual word is searched for in the dictionary. If a 
word is not contained in the dictionary, the word 
processing system issues an error message and asks the 

10 user to check the word. Such systems have been 
disclosed, for example, in U.S. Pat. Nos . 4,689,678, 
4,6 71,6 84, and 4,7 77,617. 

[0003] A word processing system has also been 
disclosed in U.S. Pat. No. 4,674,065, which is based on 

15 a statistical N-gram analysis technique. When an 
incorrect word is detected, the user is offered a list 
of possible correct alternatives to select from. 

[0004] An overview of known techniques for 
automated correction of words in a text is provided by 

20 the publication "Techniques for Automatically 
Correcting Words in Text, " by Caron Kukich, ACM 
Computing Surveys, Volume 24, No. 4, December 1992. 

[0005] The known methods for error detection and 
correction share the characteristic that a dictionary 

25 separate from the text is used as the standard for 
comparison. The known systems thus require a relatively 
large amount of memory for storing the dictionary, and 
this memory is thus not available to other 
applications . 

30 [0006] A further disadvantage of using a dictionary 

is that, in general, the dictionary itself contains 
some errors and thus cannot be relied upon as a 
standard. After all, the dictionary itself cannot be 
checked for errors by the word processing system, since 

35 a dictionary is considered to be the most reliable 
standard. Moreover, the dictionary must be continually 
updated, allowing additional errors to creep in. The 
use of known word processing systems is practically 



- 8 - 

unsuitable for checking multilingual texts, since all 
"foreign words" not present in the dictionary will be 
flagged as errors. The same also holds true for 
monolingual texts which employ unusual words or newly- 
5 coined words, as well as for computer code or texts 
that contain phonetic information or formatting 
controls. In these cases, known word processing systems 
may flag a large number of correct strings as 
incorrect, since the strings do not occur in the 
10 dictionary. This problem is especially evident when the 
text being checked includes abbreviations or formulas 
or contains proper names that are not stored in the 
dictionary . 
[0007] 

15 [Problem that the invention is intended to solve] 

The invention is thus based on the object of providing 
an improved method and computer system for detecting or 
correcting an error string in a text. 
[0008] 

20 [Means for solving the problem] The object of the 

invention is achieved by a method for detecting or 
correcting an error string Fi in a text using a computer 
system, comprising: a step of, in order to detect or 
correct an error string, employing the frequency H(Si) 

25 of the corresponding error-free string Si in said text, 
and thereby arranging that the error-free string Si 
appears in said text; and a step of storing said text 
in said computer system. 

[0009] Also, the object of the invention is 

30 achieved by a computer system, in particular a word 
processing system, wherein an error string Fi in a text 
is detected or corrected, thereby causing the 
corresponding error-free string Si to appear in said 
text, comprising: first storage means for storing said 

35 text, second storage means for storing the frequency 
H(Si) of said error-free string Si, and processor means 
using the frequency H(Si) of the error-free string Si in 
detecting or correcting the error string Fi. 



[0010] With the invention, the storing of a 
dictionary is not required, so the disadvantages 
previously described for the prior art systems are 
largely eliminated. In contrast to known word 
processing systems, according to the invention, the 
text is not checked with respect to a dictionary but is 
rather itself subjected to a statistical analysis which 
serves as the basis for error detection. Here, external 
dictionaries are not required. The frequency of the 
error-free string, given by the user, in the text forms 
the basis for detecting error variants of the string. 
The frequency of the error-free string serves as a 
measure for the probability that a possible error 
string in the text is an actual error string 
corresponding to the error-free string. The error 
string identified in this manner, if it occurs more 
than once in the text, can then be replaced 
automatically by the corresponding error-free string 
throughout the entire text. 

[0011] In one embodiment of the invention, the 
error-free string specified by the user and occurring 
in the text is modified according to at least one rule, 
so that one or more possible error strings are 
generated. In deciding whether a possible error string 
actually corresponds to an error-free string given by 
the user, the frequency of the possible error string in 
the text is determined. The frequencies of the error- 
free string and the possible error string are compared, 
and this comparison forms the basis for deciding 
whether the possible error string is an actual error 
string. The comparison of the frequencies uses the fact 
that a word occurring frequently in a text has, with 
high probability, been entered incorrectly at least 
once. Thus, the larger the ratio of the frequency of 
the error-free string to the frequency of the possible 
error string, the higher the probability that the 
possible error string is an actual error string. 

[0012] To increase the effectiveness of this search 
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for error strings in the text, in accordance with a 
preferred embodiment, the rule or rules used to modify 
the error-free string are selected such that 
psychological errors or error sources related to the 
5 computer system, in particular to its keyboard, or both 
of these, are simulated. A keyboard-related error, for 
example, is pressing a key adjacent to the desired 
character. If, for example, due to the keyboard used, 
the character "b" frequently occurs in place of its 

10 neighbor "v", this can be allowed in a corresponding 
rule. By applying the corresponding rule, a "v" 
occurring in the error-free string is replaced with 
"b", so that, from the error-free string, a possible 
error string is generated. This also occurs in the text 

15 with high probability. For any one single error-free 
string, this procedure can be repeated using different 
rules to simulate different possible errors. 

[0013] The probability that applying a specific 
rule will generate a possible error string that 

20 actually occurs in the text can, depending on the rule, 
vary with the user, the computer system used, or both. 
This probability can be subject to time-related 
variations, for example, because the user has learned 
to avoid certain kinds of errors, because a new user 

25 takes over and tends to make other kinds of errors, or 
because the computer system used is replaced with 
another, having another keyboard. This can be taken 
into consideration using a method of automated learning 
that registers the success probabilities of the rules 

30 employed. If the automated learning process shows that 
a rule often leads to detecting an error string, this 
rule will be given preference and weighted with a 
factor. An initialization of these factors can also be 
determined using a training sequence. 

35 [0014] In accordance with a further preferred 

embodiment, the entire text is automatically checked. 
In this case, the frequencies of all unique strings in 
the text are first determined. The strings whose 
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frequencies are higher than a specified threshold value 
are defined as error-free strings, since a string 
occurring very often in a text has a high probability 
of being correct. The error-free strings so defined, or 
their frequencies, then serve as the basis for error 
detection . 

[0015] In accordance with a further preferred 
embodiment, the invention relates to a character 
recognition system comprising a system for automated 
optical character recognition. The system for automated 
optical character recognition can, for example, be used 
to enter a printed text into a computer system. In this 
case, the raw text input to the computer system for the 
automated optical character recognition process is not 
error-free. The generation of errors can result from 
the fact that the printed text contains errors or that 
the system for automated optical character recognition 
does not function without errors. The raw text entered 
into the computer system is checked by the computer 
system for errors in accordance with the invention, so 
that in particular deficiencies in the system for 
automated optical character recognition can be 
corrected over a wide range . A method based on an N- 
gram technique for supporting an apparatus for 
character recognition is disclosed in U.S. Pat. No. 
4, 058, 795 . 

[0016] In accordance with a further preferred 
embodiment, the invention relates to a system for 
automated recording of dictation, comprising a speech 
recognition system. Such speech recognition systems 
have been disclosed, for example, in U.S. Pat. Nos . 
4,783,803; 4,741,036; 4,718,094; and 4,164,025. 

[0017] The speech recognition system generates a 
raw text, generally exhibiting errors, which is entered 
into a computer system. The error detection or 
correction provided by the invention is applied using 
such a computer system. 

[0018] In accordance with a further preferred 
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embodiment, the invention relates to a storage medium 
suited for use in a programmable computer system. 
Through a physical or chemical process, a program is 
recorded on the storage medium for carrying out the 
5 inventive method. Through this physical or chemical 
process, the storage medium acquires the characteristic 
of being able to interact with the programmable 
computer system such that said computer system, or a 
conventional computer system programmable for general 
10 purposes, is transformed into a computer system 
according to the invention. 
[0019] 

[Embodiments] The block diagram shown in FIG. 1 
refers, for example, to a word processing system 

15 according to the present invention, into which a text 
to be checked has already been entered. In step 10, the 
user can select an error-free string Si occurring in the 
text. The object of the inventive method is then to 
detect at least one error string Fi in the text 

20 corresponding to the selected error-free string Si, 
i.e., representing, for example, a typographical error 
when compared with the error-free string Si. 

[0020] Next, in step 11, a possible error string f ±j 
is generated. The possible error string f ±j is produced 

25 from the error-free string Si by applying rule Rj . In 
step 11, by use of rule Rj if possible on different 
letters or letter positions, multiple possible error 
strings fij are generated from the error-free string Si. 
[0021] In step 12, a value a±j is calculated as the 

30 comparison of the frequency H(Si) of the error-free 
string Sj and the frequency H(fij) of the possible error 
string fij . 

[0022] In step 13, the value a±j calculated in step 
12 is compared with a threshold value (3. If aij>p, the 
35 search result is declared in step 14 to be that the 
possible error string fij is equal to an actual error 
string Fi. This result can be used for automated 
correction of all strings Fi occurring in the text. 
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Prior to this correction, the results so obtained can 
be presented to the user for verification. In this 
case, the automated correction is carried out only if 
the user agrees with the proposed result. 
5 [0023] If the condition aij>(3 is not true, the index 

j in step 14 is incremented by 1. The result of this is 
that in the next step 11 for generating another 
possible error string, another rule Rj+1 is applied. The 
additional possible error string fi+1, j + 1 so generated 

10 represents an additional candidate, which could 
correspond to an error string Fi. This determination is 
again made in the subsequent steps 12 and 13, and, if 
applicable, the result is declared in step 14. 

[0024] According to the flow diagram in FIG. 1, the 

15 method is then terminated as soon as an error string Fi 
is determined in step 14. It can also happen, however, 
that, in this case too, additional possible error 
strings fij can be accidentally formed by applying other 
rules Rj . This corresponds to the steps 15 and 11 

20 described above. In this way, still more error strings 
Fi can be found that, for example, have arisen through 
other entry errors with respect to the error-free 
string Si selected in step 10. 

[0025] In this case, it is also possible that 

25 initially in several sequential steps 14, different 
error strings Fi are defined as results of the search 
and these error strings are presented to the user 
sorted by the corresponding a±j values. Since the a±j 
values represent a measure of the probability that a 

30 possible error string fij is an actual error string Fi 
occurring in the text, the results are therefore shown 
to the user sorted by their probability. 

[0026] In contrast to known dictionary or N-gram 
based systems, the basis for the error detection is not 

35 externally stored data - such as in the form of a 
separately stored dictionary - but rather the text 
itself being checked. In accordance with the invention, 
the otherwise externally stored data is derived from 



- 14 - 

the text being checked by determining the frequency 
H(Si). If the frequency H(Si) assumes high values, the 
invention leads to the conclusion that a possible error 
string f±j occurring seldom in the text represents an 
5 actual error string Fi. In this case, externally stored 
data and the attendant expenditure are not required. 

[0027] The rules Rj employed in step 11 for 
generating the possible error strings f ±j are preferably 
selected such that psychological errors and/or other 

10 error sources related to the computer system, in 
particular to its keyboard, are simulated. 
Psychological errors are those, for example, that are 
difficult to find when copy editing, such as errors in 
particularly long words. A keyboard-related error is, 

15 for example, one caused by inadvertent bouncing, 
producing a double letter. Inadvertent multiple entry 
or omission of a character at the keyboard can also 
occur if the keyboard exhibits a poorly defined action 
point . 

20 [0028] The calculation in step 12 of the value a ±j 

can be performed based on the computing rule of 
equation (1) below. 

[0029] 

[math 5] 

25 (tij (H(fi-j) ,H(S±) )=aij (1) 

This computing rule can preferably have the form 

[0030] 

[math 6] 

<Dij (H(fi-j) ,H(S±) ) = (H(S±) ) /H(fi-j) .¥)k (2) 
30 where ®ij is a function dependent on the frequency 

H(fij) and the frequency H(Si), the value ¥ is a factor, 
and the value k represents an exponent. 

[0031] The factor ¥ can be calculated according to 
the equation (3) below. 
35 [0032] 

[math 7] 
^=[L(S±) ] p 

where, using the function L, the length of the 
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string Si, or in other words, the number of characters 
in string Si is determined. The value p represents an 
exponent that is preferably quadratic or cubic. 

[0033] The following formula is the quotient 
5 contained in equation (2). 

[0034] 

[math 8] H (Sj ) /H (f i;j ) 

This quotient is the key element in computing the 
value ofij . The reason for this is that this quotient 

10 increases with increasing frequency of the error-free 
string Si and decreasing frequency of the possible error 
string f±j in the text. This quotient is based on the 
experience that a string occurring with high frequency 
in a text has a high probability of being correct, and 

15 that furthermore the probability that the usable string 
also occurs in the text at least once with an error - 
e.g., due to an entry error - increases with the 
frequency of the error-free string in the text. Using 
this correction factor ¥ can also take into account 

20 that with increasing string length, the probability 
that the string contains an error increases, in 
particular too because errors in long strings are 
generally not easily recognized by the user. 
Furthermore, the factor ¥ takes into consideration that 

25 with increasing word length the probability decreases 
that a modification of the error-free string Si using a 
rule Rj will lead to another error-free string Si 
occurring in the text. This has particularly strong 
influence on the calculation of the value aij; in this 

30 case a value such as 2 or 3 is chosen for the exponent 
p. The value k in the first embodiment of FIG. 1 has 
the value 1. If this value is chosen as -1, only the 
condition aij>(3 in step 13 need be replaced by aij<p. For 
simplified representation, only the case k=l will 

35 hereafter be considered, without loss of generality. 

[0035] The value aij calculated using equation (1) 
thus increases with the probability that a possible 
error string f±j is an actual error string Fi. In step 
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13, a check is therefore made whether the result based 
on comparing the frequencies H(Si) and H(fij) provides a 
sufficient measure of safety for the definition of a 
result in step 14. The choice of the corresponding 
5 threshold value (3 thereby depends on the requirements 
of the user: a high threshold value means that the 
result determined in step 14 is almost certainly 
correct, while possible error strings f ±j that also lead 
to a correct result are discarded in step 13. The 

10 opposite is true if a low value is chosen for the 
threshold value (3. 

[0036] The following tables Table 1 to Table 19 
show several examples of possible rules Rj . Also, for 
each rule an example is given with an error-free string 

15 Si, the corresponding possible error string fij, and the 
related value a±j . Following the strings Si and fij, 
their corresponding frequencies in the examined text 
are given. The text is from the sports sections of the 
"Frankfurter Rundschau" newspaper for 1988. 

20 [0037] 

[Table 1] 

Rule Ri : Transposition of two successive letters. 

[0038] Example: 
f n="01mypischen" (1) 
25 Si = "Olympischen" (875) 

an=1164625 

[0039] 

[Table 2] 

Rule R 2 : Omission of a letter occurring at least 
30 twice. 

Example : 

f 22="Prasidumssitzung" (1) 

S . sub . 2="Prasidiumssitzung" ( 7 ) 

a 22 =40824 
35 [0040] 

[Table 3] 

Rule R 3 : Omission of a letter occurring at most 
once . 
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[0041] Example: 
f 33 ="Diziplinen" (1) 
S 3 ="Disziplinen" (89) 
a 33 =H8549 
[0042] 
[Table 4] 

Rule R 4 : Doubling of a letter. 
[0043] Example: 
f 44="Baskettball" (2) 
S 4 ="Basketball" (179) 
a 4 4=89500 
[0044] 
[Table 5] 

Rule R 5 : Replacement of a letter. 
[0045] Example: 
f 55 ="Golopprennbahn" (1) 
S 5 ="Galopprennbahn" (34) 
a 55 =93296 
[0046] 
[Table 6] 

Rule R 6 : Insertion of a letter not 
previously occurring in the word. 

[004 7] Example: 

f 66="Wiederanspf if f " (1) 

S 6 ="Wiederanpf if f " (47) 

a 66 =103259 

[0048] 

[Table 7] 

Rule R 7 : Insertion of a letter previously occurring 
in the word. 
Example : 

f 77 ="Abl6seseumme" (1) 
S 7 ="Ablosesumme" (91) 
a 77 =157248 
[0049] 
[Table 8] 

Rule R 8 : Incorrect doubling of a letter, here: 
left-hand neighbor. 
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[0050] Example: 
f 88 ="Spvvg" (4) 
S 8 ="Spvgg" (142) 

a 88 =4435 
[0051] 
[Table 9] 

Rule R 9 : Incorrect doubling of a letter in a word, 
here: right-hand neighbor. 
[0052] Example: 
f 99 ="Sperrwerf en" ( 1 ) 
S 9 ="Speerwerf en" (19) 
a 99 =25289 
[0053] 
[Table 10] 

Rule Rio : instead of the desired letter, right-hand 
neighbor was pressed. 
[0054] Example: 
f ioio="erf olgteich" (1) 
Sio="erfolgreich" (290) 
aioio=385990 
[0055] 
[Table 11] 

Rule Rn : in addition to the desired letter, right- 
hand neighbor was pressed; insertion before intended 
letter . 

[0056] Example: 

f im="Cjhristian" (1) 

Sn="Christian" (175) 

aim=127575 

[0057] 

[Table 12] 

Rule R12 : in addition to the desired letter, right- 
hand neighbor was pressed; insertion after intended 
letter . 

[0058] Example: 

f i2i2="Verletzunmg" (1) 

Si2="Verletzung" (153) 

ai2i2=153000 
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[0059] 
[Table 13] 

Rule R13 : instead of the desired letter, left-hand 
neighbor was pressed. 
[006 0] Example: 
f i3i3="Problene " (1) 
Si3="Probleme" (290) 
ai3i3=148480 
[0061] 
[Table 14] 

Rule R14 : in addition to the desired letter, left- 
hand neighbor was pressed; insertion before intended 
letter . 

Example : 

f i4i4="Hof f nungstragwer " (1) 
Si4="Hof f nungstrager " (18) 
a 1414 =73728 
[0062] 
[Table 15] 

Rule R15 : in addition to desired letter, left-hand 
neighbor was pressed; insertion after intended letter. 
[0063] Example: 
f i5i5="Qualkif ikation" (1) 
Si5="Qualif ikation" (255) 
ai5i5=560235 
[0064] 
[Table 16] 

Rule Rie : Capitalization error on first letter. 
[0065] Example: 
f i6i6="olympiastadion" (1) 
Si6="01ympiastadion" (5) 
ai 6 i 6 =13720 
[0066] 
[Table 17] 

Rule Ri? : Capitalization error on second letter. 
[006 7] Example: 
f i7i7="SChwalbach" (1) 
Si7="Schwalbach" (38) 
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ai7i7=38000 
[0068] 
[Table 18] 

Rule Ris : Omission of a double letter, leaving only 
5 single letter. 

[006 9] Example: 
f i8i8="Etapensieger " (1) 
Si8="Etappensieger " (37) 
ai8i8=81289 
10 [0070] 

[Table 19] 

Rule Rig : Doubling of a doubled letter, thus 
tripling it. 

[0071] Example: 
15 fi9ig="UdSSSR" (1) 

Si9="UdSSR" (740) 
ai9ig=92500 

The rules Rj are optimally selected when essentially 
only those variants best corresponding to the observed 

20 error types are generated in step 11. In this respect, 
the following rules have proven themselves: rule Ri 
(transposition of two successive letters: from "abcba", 
"bacba", "acbba", "abbca" and "abcab"), rule R 2 
(omission of one letter of letters occurring at least 

25 twice, i.e., excepting individual letters, which occur 
only once: for example, from "abcba": "bcba", "acba", 
"abca", and "abcb"); and rule R 7 (insertion of 
individual letters, that have previously appeared: for 
example, from "abc": "aabc", "abac", "abca", "babe", 

3 0 "abbe", "abcb", "cabc", "aebe", "abec", but not "abdc" 
or the like) . 

[0072] Rule R 2 serves primarily to simulate a 
possible psychological error source. Omissions of 
letters occur very easily during manual entry but are 
35 more difficult to find during copy editing if the 
omitted letter occurs again in the string - because it 
is then not "missed" so much. 

[0073] On the other hand, rules Rio to R i5 serve to 



- 21 - 

simulate technical deficiencies of the entry method 
employed - in this case a keyboard. The technical 
deficiency of the keyboard makes itself evident in this 
example in the ergonomically unfavorable formation of 
5 the keys, so that adjacent keys are frequently pressed 
by mistake. 

[0074] A further possible rule is the replacement 
of optically similar letters in the error-free string, 
e.g., replacement of "c" by "e". In a word processing 

10 system according to the invention, use of this rule can 
simulate error sources arising from technical 
deficiencies (such as insufficient resolution) of the 
screen used to display the text. In a character 
recognition system in accordance with the invention, 

15 this and other rules can simulate technical 
deficiencies of the system for automated optical 
character recognition, since optically similar letters 
are often not correctly recognized by such systems . In 
the same way, in a system for automated recording of 

20 dictation, in accordance with the invention, 
deficiencies in the associated speech recognition 
system can be simulated. Applying the corresponding 
rules, phonetically similar letters are transposed, 
e.g., "m" with "n", since speech recognition systems 

25 often produce such errors. Of course, the rules 
mentioned can apply not only to words but also to 
strings of any composition. 

[0075] In calculating the value a±j in step 12, a 
dictionary-based method can be used in addition. The 

30 possible error string f±j is then additionally checked 
using the dictionary-based method. If the string f±j is 
contained in the dictionary, i.e., if it is a valid 
string Gi, this would initially indicate that the 
possible error string f±j is not an error. However, this 

35 is in no way certain, since an error in the 
corresponding error-free string Si can by chance also 
lead to a valid string Gi, i.e., the possible error 
string f±j can occur in the dictionary as a valid string 
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Gi in addition to being an error string Fi. Of course, 
as noted, a certain probability exists that a possible 
error string f±j occurring as a valid string Gi in the 
dictionary is not an actual error string Fi. This can be 
5 taken into account in the calculation of a±j by 
modifying the value a±j from equation (2) if the string 
fij is a valid string Gi. The modification can be done 
by multiplying the value from equation (2) by a factor 
between 0 and 1. The factor 0 here signifies that a 

10 valid string Gi is defined unconditionally as error- 
free. In this case, however, an advantageous 
characteristic of the inventive method would be lost, 
namely consideration of the context. Using the 
inventive method, the word "director" in a manual for 

15 data processing was determined to be a possible error 
variant of the word "directory", although the word 
"director" is valid. Taking context into account is 
implicit in the inventive method, since the frequencies 
H(Si) and H(fij) are compared with each other. The 

20 factor is advantageously chosen to be significantly 
greater than 0 . 

[0076] The calculation of the value a±j in step 12 
can be further influenced by a method for automated 
learning. The method for automated learning assigns a 

25 factor 5j (B) to an applied rule Rij . The factor 5j (B) 
is a variable and can be influenced on the one hand by 
the user and on the other hand by the type of hardware 
used. If the application of a rule Rj leads by the 
aforementioned average frequency to finding an error in 

30 the text, the method for automated learning assigns the 
rule Rj a corresponding factor 5j (B) greater than 1. In 
the opposite case, the method for automated learning 
assigns the rule a factor less than 1. The value a±j 
determined in step 12 from equation (2) is thus 

35 additionally multiplied by the factor 5j (B) associated 
with the applied rule Rij, so that the different success 
probabilities of the rules Rj are considered in the 
calculation of a±j. The rules Rj can be sorted according 
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to their factors 5j (B) such that the rules Rj with the 
highest probability of success, to which a relatively 
large factor 5j (B) is assigned, are applied first in 
step 11. If the inventive method is conducted fully 
5 automatically, i.e., without displaying the detected 
errors as suggestions to the user, the definition in 
step 14 is the key for the method of automated 
learning. If a suggestion is given to the user, the 
user ' s acceptance of a string proposed as being in 
10 error is the key for the method of automated learning 
and thus for determining the factors 5j (B) . The method 
for automated learning can, for example, be implemented 
with a neural network, probably in conjunction with an 
expert system. 

15 [0077] By using a system for automated learning, a 

user- and/or hardware-specific calibration can be 
implemented. For example, the transposition of "y" and 
"z", such as in "Szstem", can be expected only with 
those users who continually switch between German and 

20 American keyboards, but not with authors of newspaper 
copy, who generally work with only one type of 
keyboard. Since there are also corresponding word pairs 
which do not represent errors, for example "Holy" and 
"Holz", it is useful to consider such transpositions as 

25 possible errors only if they are reasonable for the 
application area. A hardware-related type of error that 
can be allowed through the method of automated learning 
is, for example, the inadvertent simultaneous 
depression of two adjacent keys on the keyboard, such 

30 as in "Sysrtem". The probability of this type of error 
will depend on the keyboard used - in particular its 
action point and any generation of an acoustical signal 
when pressing a key. Furthermore, the method of 
automated learning can also allow for the use, prior to 

35 the inventive method, of other spell-check methods 
which detect certain error types with difficulty. Those 
rules Rj which simulate these error types then are 
assigned a particularly heavy weighting via the factor 
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5 3 (B) . 

[0078] The user- and/or hardware-specific 
calibration can also be obtained by direct entry of the 
user- or hardware-specific weighting factors 5j (B) . The 
5 factors 5j (B) associated with a specific user, specific 
hardware, or a specific combination of user and 
hardware, can then be stored in separate data sets. If 
the user or hardware changes, the current set of 
factors 5j (B) is replaced by the set of factors 5j (B) 

10 associated with the new user or the new hardware, so 
that the latter set becomes the current one. The 
current set of factors 5j (B) serves to weight the 
values obtained from equation (2) in step 12. The value 
aij is thus obtained by multiplying the value obtained 

15 from equation (2) with the factor 5j (B) associated with 
the applied rule Rj . The current set of factors 5j (B) 
thus obtained can also serve as a set of initial values 
for the factors 5j (B) for the method of automated 
learning, so that the method can start off with user- 

20 or hardware-specific weighting factors 5j (B), which can 
then be further optimized automatically. If the user or 
hardware changes, the optimized set of factors 5j (B) 
can be stored for later use as initial values. 

[0079] In addition, it is beneficial to provide an 

25 exception table, in which frequent word pairs such as 
form/from or three/there are stored. Suitable names can 
also be stored in this table, e.g., Helmut/Hellmut or 
Hausmann/Haussmann . These could also arise from 
typographical errors . These words are thus not regarded 

30 as possible error strings in step 12. For a possible 
error string f ±j generated in step 11, a check is made 
whether this string f±j is present in the exception 
table. If so, the next step executed will be 15 rather 
than 12. 

35 [0080] FIG. 2 shows the flow diagram of a second 

preferred embodiment of the invention. In step 20, the 
frequency H(Zi) of each string Z± occurring in the text 
is first determined. In this case, each unbroken 
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sequence of letters or any other characters, depending 
on the application, can be defined as a string Zi. 

[0081] In step 21, the occurring strings Zi and 
their corresponding frequencies H(Zi) are stored 
5 pairwise in a table. In step 22, the condition H(Zi)>Y 
is tested. The value y is a threshold value for the 
frequency H(Zi), above which the corresponding string Z± 
is defined to be an error-free string Si. If, therefore, 
the frequency H(Zi) of a specific string Zi exceeds the 

10 threshold value y, this specific string Zi is defined as 
an error-free string Si. The basis for this is that a 
string occurring relatively often in a text is with 
high probability an error-free string or a correctly 
spelled word of the language that can be used. 

15 [0082] If the condition H(Zi)>y in step 22 is not 

satisfied, the next step executed is 23, in which the 
index i is incremented by one. In the subsequent step 
22, the condition H(Zi+l)>y is tested for another 
string. 

20 [0083] If the condition H(Zi)>y in step 22 is 

satisfied by a string Zi, step 24 is executed next. In 
step 24, the corresponding string Zi is defined as an 
error-free string Si. The subsequent steps 11, 12, 13, 
14, 15 correspond to the steps of the first embodiment 

25 discussed with reference to FIG. 1. Step 24 thereby 
performs the function of step 10 in the first 
embodiment, namely the selection of a specific error- 
free string Si. All possible variations previously 
discussed with respect to the first embodiment are also 

30 possible in the second embodiment. 

[0084] After completing the search for error 
strings Fi of the string Si defined as error-free in 
step 14, the condition i=i max is tested in step 25. If 
index i has reached the maximum value i max , all strings 

35 Zi occurring in the text have been examined, so that the 
process is terminated in step 27. 

[0085] If the condition i=i max is not yet satisfied, 
the index i is incremented by 1 in step 26, and in step 
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22 the condition H(Zi+l)>Y is again checked for another 
string Zi. 

[0086] Step 12 for computing the value a±j can 
advantageously be carried out by obtaining the 
5 frequency H(fij) from the table stored in step 21, so 
that the calculation is accelerated. If a possible 
error string f ±j is not present in the table, its 
frequency is 0. In this case, step 15 can be executed 
without further evaluation, so that another rule Rj can 

10 be applied to generate another possible error string. 

[0087] The result obtained in step 14 can be used 
for automated correction, as previously discussed with 
reference to FIG. 1. It can be beneficial, however, to 
store all results obtained in step 14 and, after 

15 executing step 27, sort them by the corresponding 
values of a±j . The user is then presented with a result 
list from which the user can accept or reject 
individual results for automated correction. Since the 
list is sorted by the values a±j, the most reliable 

20 results are shown first. If the threshold value (3 was 
selected relatively large, however, this procedure is 
not necessary, since in general all the results 
actually obtained in step 14 can be used, so that an 
automated correction can take place immediately without 

25 user intervention. 

[0088] To limit execution time of the process, 
e.g., because only a certain amount of computing time 
is available, the method can be terminated prematurely 
if a defined number of errors have already been found 

30 or a certain portion of computing time has been 
expended. To accelerate the process, the generation of 
possible error strings f ±j can be controlled such that 
all rules Rj are applied only if the frequency H(Si) for 
the error-free string Si associated with a possible 

35 error string f±j is high. In general, this expenditure 
will be worthwhile only if the frequency H(Si) is very 
high. A high H(Si) frequency implies a large statistical 
sample set, so that the reliability of the result in 
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step 14 increases. In the case of lower H(Si) 
frequencies, the set of rules Rj used for detecting an 
error string Fi associated with an error-free string Si 
can be limited accordingly, so that steps 11 to 15 are 
5 executed faster overall. 

[0089] If prior to executing step 22 the table 
generated in step 21 is sorted, e.g., in alphabetical 
order, further acceleration results. The search of the 
table of possible error strings fij for calculating the 

10 value aij in step 12 can then be carried out as a binary 
search. The binary search method is well-known, e.g., 
from Donald E. Knuth, "The Art of Computer 
Programming," Vol. 3, Section 6.4.1, Algorithm B, 
Addison-Wesley Publishing Company, 1973. 

15 [0090] In FIG. 3, a further possibility for storing 

the table generated in step 21 is shown. The tree 
structure depicted in FIG. 3 is generally described as 
a "linked trie", (see references such as Franklin Mark 
Liang, "Word Hy-phen-a-tion by Com-put-er", Department 

20 of Computer Science, Stanford University, August 1983, 
pp. llff. and the references cited therein; de la 
Briandais, Rene, "File searching using variable length 
keys," Proc. Western Joint Computer Conf . 15, 1959, pp. 
295-298; and Fredkin, Edward, "Trie memory," CACM 3, 

25 September 1960, pp. 490-500.) In this example, the tree 
includes nodes 30, each node 30 having entries 31 
through 34. Entry 31 contains a letter or symbol, entry 
32 contains the frequency H(Zi) of the corresponding 
string Zi, entry 33 is a pointer to a child - if present 

30 - of node 30, and entry 34 is a pointer to a sibling - 
if present - of node 30. Entry 32 in a node 30 is non- 
zero if the string from the highest level of the tree 
to the node 30 occurs in the text. An example is shown 
in FIG. 3 on the basis of a text that contains only the 

35 words "Festung", "Feuer", "Rauch", "Frieden", and 
"Fest", whereby the word "Feuer" occurs twice and the 
word "Fest" occurs three times in the text. The 
remaining words each occur only once in the text. 
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[0091] This type of storing of the table in step 21 
has the advantage of requiring less storage space and 
providing added acceleration to the process. The 
structure of the "linked trie" can be generated in 
5 parallel with determining the individual strings Zi and 
their frequencies, so that subsequent sorting is 
unnecessary. The applicable algorithm has been 
specified by Knuth (reference: Donald E. Knuth, "The 
Art of Computer Programming, " Addison-Wesley Publishing 

10 Company, 1973, Section 6.2.2, pp. 422 ff., in 
particular Algorithm T) . 

[0092] FIG. 4 shows an embodiment of a computer 
system in accordance with the invention. The computer 
system comprises storage means 1 for storing the text 

15 to be checked; storage means 12 for storing the 
frequencies H(Zi), or, in other words, for storing the 
table or tree structure established in step 21 (see 
FIG. 2 and FIG. 3); storage means 4 for storing rules Rj 
used in step 11 for generating the possible error 

20 strings fij (see FIG. 1 and FIG. 2); and processor means 
2 for process control. The processor means 2 employs 
the frequency H(Si) of the error string Fi for detecting 
the error string Fi. The storage means 1, 4, 12 and the 
processor means 2 are interconnected via a bus 16, so 

25 that the processor means can access the respective 
storage means. In this embodiment, the processor means 
2 comprises storage means 3 for storing a frequency 
H(fij) needed to compute the value a±j in step 12; means 
5 for modifying an error-free string Si in accordance 

30 with a rule Rj, whereby a possible error string fij can 
be generated according to step 11; means 6 for 
determining the frequency H(fij); means 7 for comparing 
the frequencies H(fij) and H(Si); means 8 for 
associating the possible error string fij with the error 

35 string Fi; means 11 for determining the frequency H(Zi) 
of various strings Zi in the text; and comparison means 
13 for comparing the threshold value y with the 
frequency H(Zi). The means 3, 5, 6, 7, 8, 11, 13 are 
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interconnected via a processor-internal bus 16. The 
means 3, 5, 6, 7, 8, 11, and 13 contained in the 
processor means 2, as well as bus 16, need not be 
discrete electronic components but can rather be 
5 generated via appropriate programming of processor 
means 2 . Such a program suitable for implementing the 
inventive method will interact with the control program 
of the computer system in a well-known manner such that 
the computer system assumes the configuration shown in 
10 FIG. 4. 

[0093] The means 6 for determining the frequency 
H(fij) interacts via the bus 16 with means 12 such that 
the desired frequency H(fij) can be derived from means 
12, if this frequency is stored there. If there is no 
15 entry for the possible error string f±j in the table 
stored in means 12, the frequency H(fij) is zero. The 
determination of the frequency is needed to calculate 
the value a±j in step 12. 

[0094] The means 7 for comparing the frequencies 
20 H(Si) and H(fij) comprises computing means 9 for 
computing the value a±j in accordance with the following 
computation rule. 

[0095] 

[math 9] 

25 (tij (H(fi-j) ,H(S±) )=aij (1) 

This corresponds to the comparison of the 
frequencies H(Si) and H(fij) carried out in step 12 in 
computing the value a±j . 
[0096] 

30 The means 8 for associating the possible error 

string f±j to the error string Fi comprises means 10 for 
storing the threshold value (3 for a comparison with the 
value aij . The value a±j determined for comparison by 
means 7 is transferred via bus 16 to associating means 

35 8. Associating means 8 processes the value a±j in 
accordance with steps 13 and 14. 

[0097] The means 11 for determining the frequency 
H(Zi) interact with the means 1 to identify individual 
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strings Zi in the text and to calculate the 
corresponding frequencies H(Zi), in accordance with step 
20 . 

[0098] The comparison means 13 includes means 14 
5 for storing the threshold value y. The comparison means 
13 interact with means 11 to define those strings Z± as 
error-free strings Si whose frequency H(Zi) exceeds the 
threshold value y. Using appropriate control by a 
program 17, the computer system in accordance with the 
10 invention can thereby carry out the procedure of FIG. 1 
and FIG. 2. The program can be stored in the means 17 
for program control, whereby the means 17 for program 
control interacts with the processor means 2 via bus 
16 . 

15 [0099] Using the computer system in accordance with 

the invention, the sports sections of the "Frankfurter 
Rundschau" newspaper for 1988 were examined. The 
corresponding text consists of 1,671,136 words, of 
which 77,745 are unique. The computer system and 
20 examined 5,849 possible error strings fij, of which 643 
were actual error strings Fi. The rules Rj indicated in 
Table 1 to Table 19 were applied, whereby the 
application of rules R 2 and R 3 alone resulted in 
detecting 295 different actual error strings Fi. 
25 [0100] 

[Beneficial effect of the invention] According to 
the present invention, detection or correction of error 
strings in text can be performed. 

[Brief description of the drawings] 
30 [Figure 1] is an outline flow chart of a first 

embodiment of the present invention. 

[Figure 2] is an outline flow chart of a second 
embodiment of the present invention. 

[Figure 3] is a view showing a storage structure 
35 that is ideal for storing strings according to the 
present invention. 

[Figure 4] is a view showing a computer system 
according to the present invention. 



[Explanation of the reference symbols] 

1 storage means 

2 processing means 

3 storage means 
5 4 storage means 
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6 means 

7 means 

8 means 

10 9 calculation means 

10 means 

11 means 

12 storage means 

13 comparison means 
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[Figure 1] 

10 selection of Si 

11 generation of f ±j 
5 12 calculation of a±j 

[link between 13 and 15] : No 
[link between 13 and 14] : Yes 

[Figure 2] 
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20 calculation of H(Z ± ) 

21 calculation of (Zi and H(Zi)) 
[link between 22 and 23 ] : No 
[link between 22 and 24] : Yes 

5 11 generation of fij 

12 calculation of a±j 
[link between 13 and 14] : Yes 
[link between 25 and 26] : No 
[link between 25 and 27] : Yes 



[Figure 4] 

I text 

5 means for modifying Si 
15 6 means for determining H(fij) 

7 means for comparing H(Si) and H(fij) 
aij calculation means 

8 means for associating fij and Fi. 

II means for determining H(Zi) 
20 13 comparison means 

17 program 



